Probabilistic Visitor Stitching on Cross-Device Web Logs

نویسندگان

  • Sungchul Kim
  • Nikhil Kini
  • Jay Pujara
  • Eunyee Koh
  • Lise Getoor
چکیده

Personalization – the customization of experiences, interfaces, and content to individual users – has catalyzed user growth and engagement for many web services. A critical prerequisite to personalization is establishing user identity. However the variety of devices, including mobile phones, appliances, and smart watches, from which users access web services from both anonymous and logged-in sessions poses a significant obstacle to user identification. The resulting entity resolution task of establishing user identity across devices and sessions is commonly referred to as “visitor stitching.” We introduce a general, probabilistic approach to visitor stitching using features and attributes commonly contained in web logs. Using web logs from two real-world corporate websites, we motivate the need for probabilistic models by quantifying the difficulties posed by noise, ambiguity, and missing information in deployment. Next, we introduce our approach using probabilistic soft logic (PSL), a statistical relational learning framework capable of capturing similarities across many sessions and enforcing transitivity. We present a detailed description of model features and design choices relevant to the visitor stitching problem. Finally, we evaluate our PSL model on binary classification performance for two real-world visitor stitching datasets. Our model demonstrates significantly better performance than several state-of-the-art classifiers, and we show how this advantage results from collective reasoning across sessions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Deduplication of Anonymous Web Traffic

Cookies and log in-based authentication often provide incomplete data for stitching website visitors across multiple sources, necessitating probabilistic deduplication. We address this challenge by formulating the problem as a binary classification task for pairs of anonymous visitors. We compute visitor proximity vectors by converting categorical variables like IP addresses, product search key...

متن کامل

Probabilistic Relational Models of On-line User Behavior Early Explorations

We propose the usefulness of probabilistic relational methods for modeling user behavior at web sites. Web logs (aka "click streams"), server logs, and other data sources, taken as datasets for traditional machine learning algorithms, violate the iid assumption of most algorithms. Requests ("clicks") are not independent within a session, sessions for a visitor are not independent of one another...

متن کامل

مقایسه وبلاگ های کتابخانه ها و کتابداران ایرانی با وبلاگ های برتر کتابداری؛1385

Introduction: Web logs are the evident tools for the librarians. There are three main ways for applying web logs in librarianship fields, as follows: personal use by librarian to upgrade their personal information, as a source of information in case of libraries, and for their services. The aim of this research is to comparison between Iranian libraries and librarians, and superior librarianshi...

متن کامل

Adaptive Web Sites: Automatically Synthesizing Web Pages Adaptive Web Sites the Index Page Synthesis Problem the Pagegather Algorithm

Content Areas: data mining, machine learning, applications, user interfaces Abstract The creation of a complex web site is a thorny problem in user interface design. In IJCAI '97, we challenged the AI community to address this problem by creating adaptive web sites: sites that automatically improve their organization and presentation by mining visitor access data collected in Web server logs. I...

متن کامل

Adaptive Web Sites: Automatically Synthesizing Web Pages

The creation of a complex web site is a thorny problem in user interface design. In IJCAI ’97, we challenged the AI community to address this problem by creating adaptive web sites: sites that automatically improve their organization and presentation by mining visitor access data collected in Web server logs. In this paper we introduce our own approach to this broad challenge. Specifically, we ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017